Data Tidying
Humboldt-Universität zu Berlin
2023-06-06
Last week we learned how to…
include: false should have been message: false
fig-out: 6 should have been fig-width: 6
label formattingToday we will…
tidyverse, here
daten folder):
table1.csvdf_billboard.csv (neu)cms_patient_experience.csv (neu)readr::read_csv), transform (dplyr package), and visualise (ggplot package) dataImage source: Wickham et al. (o. J.) (all rights reserved)
country, year, popuplation, and number of tuberculosis cases
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
| country | year | type | count |
|---|---|---|---|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
| country | year | rate |
|---|---|---|
| Afghanistan | 1999 | 745/19987071 |
| Afghanistan | 2000 | 2666/20595360 |
| Brazil | 1999 | 37737/172006362 |
| Brazil | 2000 | 80488/174504898 |
| China | 1999 | 212258/1272915272 |
| China | 2000 | 213766/1280428583 |
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
| country | year | type | count |
|---|---|---|---|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
| country | year | rate |
|---|---|---|
| Afghanistan | 1999 | 745/19987071 |
| Afghanistan | 2000 | 2666/20595360 |
| Brazil | 1999 | 37737/172006362 |
| Brazil | 2000 | 80488/174504898 |
| China | 1999 | 212258/1272915272 |
| China | 2000 | 213766/1280428583 |
Three rules for tidy data:
Image source: Wickham et al. (o. J.) (all rights reserved)
“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy
“Tidy datasets are all alike, but every untidy dataset is untidy in its own way.” — Hadley Wickham
There are two main advantages to working with tidy data:
tidyverse are designed to work with tidy dataVectors
Vectors are the most basic data object type in R. A vector contains data of the same type, and is essentially a list. You can create a vector using the c() function, for example.
vector1 will contain numeric values, because all elements are numbers. vector2 will contain all character values (i.e., text), because there is a singlular unambiguous character element ("c"). So, R reads all elements as character type. We can create a dataframe from vectors of the same length using the tibble() function, for example.
Aufgabe 1: Tidy data
Beispiel 1
rate) that contains:
cases per country per year, divided bypopulation per country per year,10000
dplyr verb creates new variables? (Look back at week 5)rate with these data structures?tidyverse
tidyr package
pivot_longer(): make wide data longerpivot_wider(): make long data widerAbbildung 1: die berühmteste Verwendung des Wortes Pivot (zumindest für Millenials) (Friends)
df_billboard
billboard.csv dataset
artist, track, date_entered)wk1-wk76) that describe the rank of the song that week
week), and the cell values are another variable (the rank)| artist | track | date_entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 | wk9 | wk10 | wk11 | wk12 | wk13 | wk14 | wk15 | wk16 | wk17 | wk18 | wk19 | wk20 | wk21 | wk22 | wk23 | wk24 | wk25 | wk26 | wk27 | wk28 | wk29 | wk30 | wk31 | wk32 | wk33 | wk34 | wk35 | wk36 | wk37 | wk38 | wk39 | wk40 | wk41 | wk42 | wk43 | wk44 | wk45 | wk46 | wk47 | wk48 | wk49 | wk50 | wk51 | wk52 | wk53 | wk54 | wk55 | wk56 | wk57 | wk58 | wk59 | wk60 | wk61 | wk62 | wk63 | wk64 | wk65 | wk66 | wk67 | wk68 | wk69 | wk70 | wk71 | wk72 | wk73 | wk74 | wk75 | wk76 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 87 | 82 | 72 | 77 | 87 | 94 | 99 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2Ge+her | The Hardest Part Of ... | 2000-09-02 | 91 | 87 | 92 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 3 Doors Down | Kryptonite | 2000-04-08 | 81 | 70 | 68 | 67 | 66 | 57 | 54 | 53 | 51 | 51 | 51 | 51 | 47 | 44 | 38 | 28 | 22 | 18 | 18 | 14 | 12 | 7 | 6 | 6 | 6 | 5 | 5 | 4 | 4 | 4 | 4 | 3 | 3 | 3 | 4 | 5 | 5 | 9 | 9 | 15 | 14 | 13 | 14 | 16 | 17 | 21 | 22 | 24 | 28 | 33 | 42 | 42 | 49 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 3 Doors Down | Loser | 2000-10-21 | 76 | 76 | 72 | 69 | 67 | 65 | 55 | 59 | 62 | 61 | 61 | 59 | 61 | 66 | 72 | 76 | 75 | 67 | 73 | 70 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 504 Boyz | Wobble Wobble | 2000-04-15 | 57 | 34 | 25 | 17 | 17 | 31 | 36 | 49 | 53 | 57 | 64 | 70 | 75 | 76 | 78 | 85 | 92 | 96 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 98^0 | Give Me Just One Nig... | 2000-08-19 | 51 | 39 | 34 | 26 | 26 | 19 | 2 | 2 | 3 | 6 | 7 | 22 | 29 | 36 | 47 | 67 | 66 | 84 | 93 | 94 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
pivot_longer()pivot_longer()
| artist | track | date_entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk1 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk2 | 82 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk3 | 72 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk4 | 77 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk5 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk6 | 94 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk7 | 99 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk8 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk9 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk10 | NA |
col = specifies which columns need to be pivoted (i.e., which are not variables)
select(), so we could use e.g., starts_with("")
names_to = names the variable stored in the current column names, here it is week
values_to = names the variable stored int he cell values, which we name rank
week and rank with quotation marks because they aren’t variable names yet
NAs)NA (missing) values; this happened when a song was not in the top 100
values_drop_na = TRUE/FALSE argument drops pivoted rows don’t have a value for the new variable| artist | track | date_entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk1 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk2 | 82 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk3 | 72 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk4 | 77 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk5 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk6 | 94 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk7 | 99 |
| 2Ge+her | The Hardest Part Of ... | 2000-09-02 | wk1 | 91 |
| 2Ge+her | The Hardest Part Of ... | 2000-09-02 | wk2 | 87 |
| 2Ge+her | The Hardest Part Of ... | 2000-09-02 | wk3 | 92 |
week still contains wk in the valuesreadr package has a handy function: parse_number() extracts the first number from a string, ignoring all other text
parse_number() to alter the variable week?| artist | track | date_entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 1 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 2 | 82 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 3 | 72 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 4 | 77 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 5 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 6 | 94 |
tidyverse also has a package that makes working with dates easier: lubridate
date_entered has the format year-month-day (ymd)lubridate verbs to extract the year, month, and day
week data in one variable and the rank data in another variableAbbildung 2: Billboard ranks by number of weeks for songs that were in the top 100 in the year 2000
Aufgabe 1: Tidy data
Beispiel 2
df_cms
pivot_wider() make datasets wider by increasing columns and reducing rows
cms_patient_experience.csv dataset contains data about patient experiences fom the Centers of Medicare and Medicaid servicesorg_pac_id and org_nm), but each organisation (i.e., observation) takes up 6 rows
| org_pac_id | org_nm | measure_cd | measure_title | prf_rate |
|---|---|---|---|---|
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_1 | CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Information | 63 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_2 | CAHPS for MIPS SSM: How Well Providers Communicate | 87 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_3 | CAHPS for MIPS SSM: Patient's Rating of Provider | 86 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_5 | CAHPS for MIPS SSM: Health Promotion and Education | 57 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_8 | CAHPS for MIPS SSM: Courteous and Helpful Office Staff | 85 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_12 | CAHPS for MIPS SSM: Stewardship of Patient Resources | 24 |
pivot_wider()id_cols: identifying columnsnames_from: what should we call the new column containing the previous column names?names_prefix: prefix for the new column namesvalues_from: new column valuesorg,
measure_cd and create six new variable names from them,values from prf_rate?| org_pac_id | org_nm | CAHPS_GRP_1 | CAHPS_GRP_2 | CAHPS_GRP_3 | CAHPS_GRP_5 | CAHPS_GRP_8 | CAHPS_GRP_12 |
|---|---|---|---|---|---|---|---|
| 0446157747 | USC CARE MEDICAL GROUP INC | 63 | 87 | 86 | 57 | 85 | 24 |
| 0446162697 | ASSOCIATION OF UNIVERSITY PHYSICIANS | 59 | 85 | 83 | 63 | 88 | 22 |
| 0547164295 | BEAVER MEDICAL GROUP PC | 49 | NA | 75 | 44 | 73 | 12 |
| 0749333730 | CAPE PHYSICIANS ASSOCIATES PA | 67 | 84 | 85 | 65 | 82 | 24 |
| 0840104360 | ALLIANCE PHYSICIANS INC | 66 | 87 | 87 | 64 | 87 | 28 |
| 0840109864 | REX HOSPITAL INC | 73 | 87 | 84 | 67 | 91 | 30 |
| org_pac_id | org_nm | measure_cd | measure_title | prf_rate |
|---|---|---|---|---|
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_1 | CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Information | 63 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_2 | CAHPS for MIPS SSM: How Well Providers Communicate | 87 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_3 | CAHPS for MIPS SSM: Patient's Rating of Provider | 86 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_5 | CAHPS for MIPS SSM: Health Promotion and Education | 57 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_8 | CAHPS for MIPS SSM: Courteous and Helpful Office Staff | 85 |
| 0446157747 | USC CARE MEDICAL GROUP INC | CAHPS_GRP_12 | CAHPS for MIPS SSM: Stewardship of Patient Resources | 24 |
| org_pac_id | org_nm | CAHPS_GRP_1 | CAHPS_GRP_2 | CAHPS_GRP_3 | CAHPS_GRP_5 | CAHPS_GRP_8 | CAHPS_GRP_12 |
|---|---|---|---|---|---|---|---|
| 0446157747 | USC CARE MEDICAL GROUP INC | 63 | 87 | 86 | 57 | 85 | 24 |
| 0446162697 | ASSOCIATION OF UNIVERSITY PHYSICIANS | 59 | 85 | 83 | 63 | 88 | 22 |
| 0547164295 | BEAVER MEDICAL GROUP PC | 49 | NA | 75 | 44 | 73 | 12 |
| 0749333730 | CAPE PHYSICIANS ASSOCIATES PA | 67 | 84 | 85 | 65 | 82 | 24 |
| 0840104360 | ALLIANCE PHYSICIANS INC | 66 | 87 | 87 | 64 | 87 | 28 |
| 0840109864 | REX HOSPITAL INC | 73 | 87 | 84 | 67 | 91 | 30 |
An example: Abbildung 3 shows the trend of ‘Say My Name’ by Destiny’s child, which entered the charts on Dec. 25th, 1999 (date_entered) at number 83 (rank for wk1), and stayed on the top 100 for 32 weeks (max(week)).
Abbildung 3: Example linegraph for ‘Say My Name’ by Destiny’s Child
biondo_etal_2021_tidy.csv (subset of data from Biondo et al. (2022)) and save it as df_biondo
knitr::kable()) with a label and caption of the head() of the datart and tt are in one column:
names should go to a new variable called measure
values should go to a new variable ms (for milliseconds)df_biondo_long
knitr::kable()) with a label and captiondf_biondo_long so that the columns rt and tt are back in their own columns
id_cols should be subj and item
names should come from measure
values should come from ms (for milliseconds)df_biondo_wide
knitr::kable()) with a label and captiondf_biondo_wide should be the exact same as df_biondo
Heute haben wir…
dplyr verbs from week 3 ✅Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.3.0.386 (Cherry Blossom).
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] here_1.0.1 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[5] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
[9] tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.3 generics_0.1.3 xml2_1.3.4
[4] stringi_1.7.12 hms_1.1.3 digest_0.6.31
[7] magrittr_2.0.3 evaluate_0.21 grid_4.3.0
[10] timechange_0.2.0 fastmap_1.1.1 rprojroot_2.0.3
[13] jsonlite_1.8.4 httr_1.4.6 rvest_1.0.3
[16] fansi_1.0.4 viridisLite_0.4.2 scales_1.2.1
[19] cli_3.6.1 rlang_1.1.1 crayon_1.5.2
[22] bit64_4.0.5 munsell_0.5.0 withr_2.5.0
[25] yaml_2.3.7 tools_4.3.0 parallel_4.3.0
[28] tzdb_0.4.0 colorspace_2.1-0 webshot_0.5.4
[31] pacman_0.5.1 kableExtra_1.3.4.9000 png_0.1-8
[34] vctrs_0.6.2 R6_2.5.1 lifecycle_1.0.3
[37] magick_2.7.4 bit_4.0.5 vroom_1.6.3
[40] pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.3
[43] glue_1.6.2 Rcpp_1.0.10 systemfonts_1.0.4
[46] highr_0.10 xfun_0.39 tidyselect_1.2.0
[49] rstudioapi_0.14 knitr_1.42 farver_2.1.1
[52] htmltools_0.5.5 labeling_0.4.2 svglite_2.1.1
[55] rmarkdown_2.21 compiler_4.3.0
Woche 8 - Data Tidying